Hierarchical Clustering in Medical Document Collections: the BIC-Means Method

نویسندگان

  • Nikos Hourdakis
  • Michalis Argyriou
  • Euripides G. M. Petrakis
  • Evangelos E. Milios
چکیده

Hierarchical clustering of text collections is a key problem in document management and retrieval. In partitional hierarchical clustering, which is more efficient than its agglomerative counterpart, the entire collection is split into clusters and the individual clusters are further split until a heuristically-motivated termination criterion is met. In this paper, we define the BIC-means algorithm, which applies the Bayesian Information Criterion (BIC) as a domain independent termination criterion for partitional hierarchical clustering. We evaluate the effectiveness of BIC-means in clustering and retrieval on medical document collections and we propose a dynamic version of the BIC-Means algorithm for adapting an existing clustering solution to document additions.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...

متن کامل

Implementation of Hybrid Clustering Algorithm with Enhanced K-Means and Hierarchal Clustering

We are propose a hybrid clustering method, the methodology combines the strengths of both partitioning and agglomerative clustering methods. Clustering algorithms that build meaningful hierarchies out of large document collections are ideal tools for their interactive visualization and exploration as they provide data-views that are consistent, predictable, and at different levels of granularit...

متن کامل

An Empirical Study of K-Means Initialization Methods for Document Clustering

Everyday vast amounts of documents, e-mails, and web pages are generated. In order to handle these data, automatic techniques such as document clustering are needed. The k-means method is a clustering technique widely used in practice because of its simplicity and empirical speed. In this paper, the basic k-means algorithm is augmented with two special initialization techniques that aim at impr...

متن کامل

Giving an Upprebound of the Number of Clusters and Relevant Words in Hierarchical Document Clustering Based on BIC

A new generative model based approach to automatic document clustering, using the BIC as the model selection criterion is described. A new method based on a graphical model is proposed to give an upperbound to the numbers of clusters and relevant words. The result of an experiment using the NTCIR web data collection is briefly reported.

متن کامل

Exploiting parallelism to support scalable hierarchical clustering

A distributed memory parallel version of the group average Hierarchical Agglomerative Clustering algorithm is proposed to enable scaling the document clustering problem to large collections. Using standard message passing operations reduces interprocess communication while maintaining efficient load balancing. In a series of experiments using a subset of a standard TREC test collection, our par...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • JDIM

دوره 8  شماره 

صفحات  -

تاریخ انتشار 2010